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Abstract. The logit model is often used to analyze experimental data. 
However, randomization does not justify the model, so the usual esti- 
mators can be inconsistent. A consistent estimator is proposed. Ney- 
man's non-parametric setup is used as a benchmark. In this setup, 
each subject has two potential responses, one if treated and the other 
if untreated; only one of the two responses can be observed. Beside 
the mathematics, there are simulation results, a brief review of the 
literature, and some recommendations for practice. 

Key words and phrases: Models, randomization, logistic regression, 
logit, average predicted probability. 
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1. INTRODUCTION 

The logit model is often fitted to experimental 
data. As explained below, randomization does not 
justify the assumptions behind the model. Thus, the 
conventional estimator of log odds is difficult to in- 
terpret; an alternative will be suggested. Neyman's 
setup is used to define parameters and prove re- 
sults. (Grammatical niceties apart, the terms "logit 
model" and "logistic regression" are used interchange- 
ably.) 

After explaining the models and estimators, we 
present simulations to illustrate the findings. A brief 
review of the literature describes the history and 
current usage. Some practical recommendations are 
derived from the theory. Analytic proofs are sketched 
at the end of the paper. 

2. NEYMAN 

There is a study population with n subjects in- 
dexed by i = 1, . . . , n. Fix ttt with < ttt < 1. Choose 
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n-KT subjects at random and assign them to the 
treatment condition. The remaining nirc subjects 
are assigned to a control condition, where ttq = 1 — 
ttt- According to Neyman (1923), each subject has 
two responses: Yp if assigned to treatment, and Yp 
if assigned to control. The responses are 1 or 0, 
where 1 is "success" and is "failure." Responses 
are fixed, that is, not random. 

If i is assigned to treatment (T), then Yp is ob- 
served. Conversely, if i is assigned to control (C), 
then Yp is observed. Either one of the responses 
may be observed, but not both. Thus, responses are 
subject-level parameters. Even so, responses are es- 
timable (see Section 9). Each subject has a covari- 
ate Zj, unaffected by assignment; Z% is observable. 
In this setup, the only stochastic element is the ran- 
domization: conditional on the assignment variable 
Xi, the observed response Y{ = XiYp + (1 — Xi)Yc 
is deterministic. 

Population-level ITT (intention-to-treat) param- 
eters are defined by taking averages over all n sub- 
jects in the study population: 
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(1) 
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For example, a T is the fraction of successes if all 
subjects are assigned to T; similarly for oF . A pa- 
rameter of considerable interest is the differential log 
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odds of success, 
(2) A = log- 
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log 



c 



a 



C ' 



1 — a 1 °1 
The logit model is all about log odds (more on this 
below). The parameter A defined by (2) may there- 
fore be what investigators think is estimated by run- 
ning logistic regressions on experimental data, al- 
though that idea is seldom explicit. 

The Intention-to- Treat Principle 

The intention-to-treat principle, which goes back 
to Hill (1961, page 259), is to make comparisons 
based on treatment assigned rather than treatment 
received. Such comparisons take full advantage of 
the randomization, thereby avoiding biases due to 
self-selection. For example, the unbiased estimators 
for the parameters in (1) are the fraction of successes 
in the treatment group and the control group, re- 
spectively. Below, these will be called ITT estima- 
tors. ITT estimators measure the effect of assign- 
ment rather than treatment. With crossover, the 
distinction matters. For additional discussion, see 
Freedman (2006a). 

3. THE LOGIT MODEL 

To set up the logit model, we consider a study 
population of n subjects, indexed by i = l,...,n. 
Each subject has three observable random variables: 
Yi,Xi,Zi. Here, Yi is the response, which is or 1. 
The primary interest is the "effect" of Xi on Yi, and 
Zi is a covariate. 

For our purposes, the best way to formulate the 
model involves a latent (unobservable) random vari- 
able Ui for each subject. These are assumed to be 
independent across subjects, with a common logistic 
distribution: for — oo < u < oo, 

(3) P(C/i<u)=exp(ii)/[l + exp(u)], 

where exp(ii) = e u . The model assumes that X and 
Z are exogenous, that is, independent of U . More 
formally, {Xj, Zi : i = 1, . . . , n} is assumed to inde- 
pendent of {Ui : i = 1, . . . , n}. Finally, the model as- 
sumes that Yi = 1 if 

f3 1 +p 2 X i + p 3 Z l + U l >0; 

else, Y{ = 0. 

Given X and Z, it follows that responses are inde- 
pendent across subjects, the conditional probability 
that Yi = l being p(0,Xi,Zi), where 

(4) p(0xz) = ^P^i+^ + Z^) 
{) p W> x > z ) l + exp(0i + + 



(To verify this, check first that —Ui is distributed 
like -H7j.) The parameter vector = {0\, 02, 3 ) is 
usually estimated by maximum likelihood. We de- 
note the MLE by 0. 

Interpreting the Coefficients in the Model 

In the case of primary interest, Xi is 1 or 0. Con- 
sider the log odds Xf of success when Xi = 1 , as well 
as the log odds Xf when Xi = 0. In view of (4), 

p(0,l,Zi) 



x 



(5) 



lOg l-p(0,l,Zi 

01+02+0^, 

p(0,O,Zi) 



" l log i- P (0,o,z, i ) 

= 1 + 3 Z i . 

In particular, — Xf = 02 for all i, whatever the 
value of Zi may be. Thus, according to the model, 
Xi = 1 adds 02 to the log odds of success. 

Application to Experimental Data 

To apply the model to experimental data, define 
Xi = 1 if i is assigned to T, while Xi = if i assigned 
to C. Notice that the model not justified by random- 
ization. Why would the logit specification be correct 
rather than the probit — or anything else? What jus- 
tifies the choice of covariates? Why are they exoge- 
nous? If the model is wrong, what is 02 supposed 
to be estimating? The last rhetorical question may 
have an answer: the parameter A in (2) seems like 
a natural choice, as indicated above. 

More technically, from Neyman's perspective, given 
the assignment variables {X}, the responses are de- 
terministic: Yi = Yi T HXi = l, while Y { = Yp if Xi = 
0. The logit model, on the other hand, views the re- 
sponses {Yi} as random — with a specified distribution- 
given the assignment variables and covariates. 

The contrast is therefore between two styles of 
inference. 

• Randomization provides a known distribution for 
the assignment variables; statistical inferences are 
based on this distribution. 

• Modeling assumes a distribution for the latent 
variables; statistical inferences are based on that 
assumption. Furthermore, model-based inferences 
are conditional on the assignment variables and 
covariates. 

A similar contrast will be found in other areas too, 
including sample surveys. See Koch and Gillings (2005) 
for a review and pointers to the literature. 
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What if the Logit Model is Right? 

Suppose the model is right, and there is a causal 
interpretation. We can intervene and set X; to 1 
without changing the Z's or U's, so Y{ = 1 if and 
only if Pi + fa + fi^Zi + U{ > 0. Similarly, we can set 
Xi to without changing anything else, and then 
Yi = 1 if and only if /3i + fyZi + Ui > 0. Notice that 
/?2 appears when Xj is set to 1, but disappears when 
Xi is set to 0. 

On this basis, for each subject, whatever the value 
of Zi may be, setting Xi to 1 rather than adds /?2 
to the log odds of success. If the model is right, @2 is 
a very useful parameter, which is well estimated by 
the MLE provided n is large. For additional detail 
on causal modeling and estimation, see Freedman 
(2005). 

Even if the model is right and n is large, /?2 differs 
from A in (2). For instance, a T will be nearly equal 
to i ELiKA 1, Zi). So loga T - log(l - a T ) will be 
nearly equal to 



1 



log -5>(/?,l,^ 



n 



(6) 



i=i 



1 



log -Yfi-ptf^Zi)] 



n . 
i=i 



Likewise, log a — log(l — a ) will be nearly equal 
to 



log ( -f>(/?, 0,^) 



n 



(7) 



i=i 



1 n i 



n 



j=l 



Taking the log of an average, however, is quite 
different from taking the average of the logs. The 
former is relevant for A in (2), as shown by (6)— (7); 
the latter for computing 



(8) 



From Neyman to Logits 

How could we get from Neyman to the logit model? 
To begin with, we would allow Yi T and Yp to be 0-1 
valued random variables; the Zi can be random too. 
To define the parameters in (1) and (2), we would re- 
place Yi T and Yi by their expectations. None of this 
is problematic, and the Neyman model is now ex- 
tremely general and flexible. Randomization makes 
the assignment variables {Xi} independent of the 
potential responses Yi T , Yp . 

To get the logit model, however, we would need 
to specialize this setup considerably, assuming the 
existence of IID logistic random variables Ui, inde- 
pendent of the covariates Zi, with 

Yp = 1 if and only if 



(9) 



Yi 



c 



P1+P2 + fcZi + Ui>0, 
1 if and only if 



(3x + p 3 Zi + Ui>0. 

Besides (9), the restrictive assumptions are the fol- 
lowing: 

(i) The Ui are independent of the Z^. 

(ii) The Ui are independent across subjects i. 
(hi) The Ui have a common logistic distribution. 

If you are willing to make these assumptions, what 
randomization contributes is a guarantee that the 
assignment variables {Xi} are independent of the 
latent variables {Ui}. Randomization does not guar- 
antee the existence of the Ui, or the truth of (9), or 
the validity of (i)-(iii). 

4. A PLUG-IN ESTIMATOR FOR THE LOG 
ODDS 

If a logit model is fitted to experimental data, av- 
erage predicted probabilities are computed by plug- 
ging (5 into (4): 



a 



1 



n 



(10a) 



i=i 



where the log odds of success \J and Ap were com- 
puted in (5). 

The difference between averaging inside and out- 
side the logs may be surprising at first, but in the 
end, that difference is why you should put confounders 
like Z into the equation — if you believe the model. 
Section 9 below gives further detail, and an inequal- 
ity relating to A. 



& c = -J2p0AZi). 



i=l 



(The tilde notation is needed; a T and dF will make 
their appearances momentarily.) Then the differen- 
tial log odds in (2) can be estimated by plugging 
into the formula for A: 

C 



(10b) A = lo£ 



a 



log- 



a 



1 — a T 1 — a c ' 
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As will be seen below, A is consistent. 

The ITT (intention-to-treat) estimators are de- 
fined as follows: 

(11a) d T = -VFi, a c 



— E*. 



where ny = rnxj- is the number of subjects in T and 
nc = nnc is the number of subjects in C. Then 



(lib) A = log-^ T -lo g -^^. 

I — a 1 1 — a° 

The ITT estimators are consistent too, with asymp- 
totics discussed in Freedman (2008a, 2008b). The 
intuition: a T is the average success rate in the treat- 
ment group, and the sample average is a good esti- 
mator for the population average. The same reason- 
ing applies to dF . 

5. SIMULATIONS 

The simulations in this section are designed to 
show what happens when the logit model is fitted 
to experimental data. The data generating mech- 
anism is not the logit, so the simulations illustrate 
the consequences of specification error. The stochas- 
tic element is the randomization, as in Section 2. 
(Some auxiliary randomness is introduced to con- 
struct the individual-level parameters, but that gets 
conditioned away.) Let n = 100, 500, 1000, 5000. For 
i = 1, . . . ,n: 

let Ui,Vi be IID uniform random variables, 

let Zi = V h 

let Yf = 1 if Ui > 1/2, else Yf = 0, 
let Y? = liiUi + Vi> 3/4, else Y? = 0. 
Suppose n is very large. The mean response in 
the control condition is around P(U > 1/2) = 1/2, 
so the odds of success in the control condition are 
around 1. (The qualifiers are needed because the U 
are chosen at random.) The mean response in the 
treatment condition is around 23/32, because 

P(U + Vi< 3/4) = (1/2) x (3/4) 2 = 9/32. 

So the odds of success in the treatment condition 
are around (23/32)/(9/32). The parameter A in (2) 
will therefore be around 



a 



c 



log 



23/32 
9/32 



log 1 = log 



23 



0.938. 



Even for moderately large n, non-linearity in (2) is 
an issue, and the approximation given for A is un- 
satisfactory. 



Table 1 

Simulations for n = 100, 500, 1000, 5000. Twenty-five 
percent of the subjects are assigned at random to C , the rest 
to T . Averages and SDs are shown for the MLE (3 and the 
plug-in estimator A, as well as the true value of the 
differential log odds A defined in (2). There are 1,000 
simulated experiments for each n 















n 


0i 


02 


0s 


Plug-in 


Truth 


100 


-0.699 


1.344 


2.327 


1.248 


1.245 




0.457 


0.540 


0.621 


0.499 




500 


-1.750 


1.263 


3.318 


1.053 


1.053 




0.214 


0.234 


0.227 


0.194 




1000 


-1.568 


1.046 


3.173 


0.885 


0.883 




0.155 


0.169 


0.154 


0.142 




5000 


-1.676 


1.134 


3.333 


0.937 


0.939 




0.071 


0.076 


0.072 


0.062 





The construction produces individual-level varia- 
tion: a majority of subjects are unaffected by treat- 
ment, about 1/4 are helped, about 1/32 are harmed. 
The covariate is reasonably informative about the 
effect of treatment — if Zi is big, treatment is likely 
to help. 



Having constructed Z%, Y% and Yj for i = 1, . . . , 
n, we freeze them, and simulate 1000 randomized 
controlled experiments, where 25% of the subjects 
are assigned to C and 75% to T. We fit a logit model 
to the data generated by each experiment, comput- 
ing the MLE (3 and the plug-in estimator A defined 
by (10b). The average of the 1000 /3's and A's is 
shown in Table 1, along with the true value of the 
differential log odds, namely, A in (2). We distin- 
guish between the standard deviation (SD) and the 
standard error (SE). Below each average, the table 
shows the corresponding SD. 

For example, with n = 100, the average of the 1000 
/Vs is 1.344; the SD is 0.540; the Monte Carlo SE 
in the average is therefore 0.540/^1000 = 0.017. The 
average of the 1000 plug-in estimates is 1.248, and 
the true A is 1.245. When n = 5000, the bias in fa 
as an estimator of A is 1.134 — 0.939 = 0.195, with 
a Monte Carlo SE of 0. 076/ VlOOO = 0.002. There 
is a confusion to avoid: n is the number of subjects 
in the study population, varying from 100 to 5000, 
but the number of simulated experiments is fixed at 
1000. (The Monte Carlo SE measures the impact of 
randomness in the simulation, which is based on a 
sample of "only" 1000 observations.) 

The plug-in estimator is essentially unbiased and 
less variable than 02- The true value of A changes 
from one n to the next, since values of Yi , Yi T are 
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generated by Monte Carlo for each n. Even with 
n = 5000, the true value of A would change from 
one run to another, the SD across runs being about 
0.03 (not shown in the table). 

Parameter choices — for instance, the joint distri- 
bution of (Ui,Vi) — were somewhat arbitrary. Sur- 
prisingly, bias depends on the fraction of subjects 
assigned to T. On the other hand, changing the cut- 
points used to define Yi and Yi T from 1/2 and 3/4 
to 0.95 and 1.5 makes little difference to the per- 
formance of 02 an d the plug- in estimator. In these 
examples, the plug- in estimator and the ITT estima- 
tors are essentially unbiased; the latter has slightly 
smaller variance. 

The bias in 02 depends very much on the covari- 
ate. For instance, if the covariate is Ui + Vi rather 
than Vi, then 02 hovers around 3. Truth remains 
in the vicinity of 1, so the bias in 02 is huge. The 
plug-in and ITT estimators remain essentially unbi- 
ased, with variances much smaller than 02] the ITT 
estimator has higher variance than the plug-in esti- 
mator (data not shown for variations on the basic 
setup, or ITT estimators). 

The Monte Carlo results suggest the following: 

(i) As n gets large, the MLE stabilizes. 

(ii) The plug-in estimator A is a good estimator 
of the differential log odds A. 

(hi) 02 tends to over-estimate A > 0. 

These points will be verified analytically below. 

6. EXTENSIONS AND IMPLICATIONS 

Suppose the differential log odds of success is the 
parameter to be estimated. Then 02 is generally the 
wrong estimator to use — whether the logit model is 
right or the logit model is wrong (Section 9 has a 
mathematical proof). It is better to use the plug- in 
estimator (10) or the ITT estimator (11). These es- 
timators are nearly unbiased, and in many examples 
have smaller variances too. 

Although details remain to be checked, the con- 
vergence arguments in Section 8 seem to extend to 
probits, the parameter corresponding to (2) being 

*- 1 (a T )-$- 1 (a c ), 

where $ is the standard normal distribution func- 
tion. On the other hand, with the probit, the plug- 
in estimators are unlikely to be consistent, since the 
analogs of the likelihood equations (16)-(18) below 
involve weighted averages rather than simple aver- 
ages. 



In simulation studies (not reported here) , the pro- 
bit behaves very much like the logit, with the usual 
difference in scale: probit coefficients are about 5/8 
of their logit counterparts (Amemiya, 1981, page 1487) 
Numerical calculations also confirm inconsistency of 
the plug- in estimators, although the asymptotic bias 
is small. 

According to the logit and probit models, if treat- 
ment improves the chances of success, it does so for 
all subjects. In reality, of course, treatment may help 
some subgroups and hurt others. Subgroup analysis 
can therefore be a useful check on the models. Con- 
sistency of the plug-in estimators — as defined here — 
does not preclude subgroup effects. 

Logit models, probit models, and their ilk are not 
justified by randomization. This has implications for 
practice. Rates and averages for the treatment and 
control groups should be compared before the mod- 
eling starts. If the models change the substantive 
results, that raises questions that need to be ad- 
dressed. 

There may be an objection that models take ad- 
vantage of additional information. The objection has 
some merit if the models are right or nearly right. 
On the other hand, if the models cannot be vali- 
dated, conclusions drawn from them must be shaky. 
"Cross-tabulation before regression" is a slogan to 
be considered. 

7. LITERATURE REVIEW 

Logit and probit models are often used to analyze 
experimental data. See Pate and Hamilton (1992), 
Gilens (2001), Hu (2003), Duch and Palmer (2004), 
Frey and Meier (2004), Gertler (2004). The plug-in 
estimator discussed here is similar to the "average 
treatment effect" sometimes reported in the litera- 
ture; see, for example, Evans and Schwab (1995). For 
additional discussion, see Lane and Nelder (1982), 
Brant (1996). 

Lim (1999) conjectured that plug-in estimators 
based on the logit model would be consistent, with 
an informal argument based on the likelihood equa- 
tion. He also conjectured inconsistency for the pro- 
bit. Middleton (2007) discusses inconsistent logit es- 
timators. 

The logistic distribution may first have been used 
to model population growth. See Verhulst (1845) 
and Yule (1925). Later, the distribution was used to 
model dose-response in bioassays (Berkson, 1944). 
An early biomedical application to causal inference 



6 



D. A. FREEDMAN 



is Truett, Cornfield, and Kannel (1967). The his- 
tory is considered further in Freedman (2005). The 
present paper extends previous results on linear re- 
gression (Freedman, 2008a, 2008b). 

Statistical models for causation go back to Jerzy 
Neyman's work on agricultural experiments in the 
early part of the 20th century. The key paper, Ney- 
man (1923), was in Polish. There was an extended 
discussion by Scheffe (1956), and an English trans- 
lation by Dabrowska and Speed (1990). The model 
was covered in elementary textbooks in the 1960s; 
see, for instance, Hodges and Lehmann (1964, Sec- 
tion 9.4). The setup is often called "Rubin's model," 
due in part to Holland (1986); that mistakes the his- 
tory. 

Neyman, Kolodziejczyk, and Iwaszkiewicz (1935) 
develop models with subject-specific random effects 
that depend on assignment, the objective being to 
estimate average expected values under various cir- 
cumstances. This is discussed in Section 4 of Scheffe 
(1956). 

Heckman (2000) explains the role of potential out- 
comes in econometrics. In epidemiology, a good source 
is Robins (1999). Rosenbaum (2002) proposes using 
models and permutation tests as devices for 
hypothesis-testing. This avoids difficulties outlined 
here: (i) if treatment has no effect, then Yp = Yp = 
Yi for all i, and (ii) randomization makes all permu- 
tations of i equally likely — which is just what per- 
mutation tests need. 

Rosenblum and van der Laan (2008) suggest that, 
at least for purposes of hypothesis testing, robust 
SEs will fix problems created by specification error. 
Such optimism is unwarranted. Under the alterna- 
tive hypothesis, the robust SE is unsatisfactory be- 
cause it ignores bias (Freedman, 2006b). 

Under the null hypothesis, the robust SE may 
be asymptotically correct, but using it can reduce 
power (Freedman, 2008a, 2008b). In any event, if 
the null hypothesis is to be tested using model-based 
adjustments, exact P-values can be computed by 
permutation methods, as suggested by Rosenbaum 
(2002). 

Models are often deployed to infer causation from 
association. For a discussion from various perspec- 
tives, see Berk (2004), Brady and Collier (2004), 
and Freedman (2005). The last summarizes a cross- 
section of the literature on this topic (pages 192- 
200). 

Consider a logit model like the one in Section 3. 
Omitting the covariate Z from the equation is called 



marginalizing over Z. The model is collapsible if the 
marginal model is again logit with the same /?2- In 
other words, given the X's, the Y's are conditionally 
independent, and 

P(Y = 1\X)= «*P(ft+/W 
1 1 ' %> 1 + exp(A + foXi) ' 

Guo and Geng (1995) give conditions for collapsi- 
bility; also see Ducharme and Lepage (1986). Gail 
(1986, 1988) discusses collapsing when a design is 
balanced. Robinson and Jewell (1991) show that col- 
lapsing will usually decrease variance: logit models 
differ from linear models. Aris et al. (2000) review 
the literature and consider modeling strategies to 
compensate for non-collapsibility. 

8. SKETCH OF PROOFS 

We are fitting the logit model, which is incor- 
rect, to data from an experiment. As before, let Xi 
be the assignment variable, so Xi = 1 if i £ T and 
Xi = if i £ C. Let Yi be the observed response, so 
Y, = X t Y? + (1 - Xi)Yi C . Let L n (j3) be the "log- 
likelihood function" to be maximized. The quote 
marks are there because the model is wrong; L n is 
therefore only a pseudo-log-likelihood function. Ab- 
breviate Pi{(5) for p(P,Xi,Zi) in (4). The formula 
for L n {j3) is this: 

n 

(12a) L n {P)=Y,Ti, 
where 

r i = iog[i-p i (/3)] 

(12b) 

+ {P 1 + (3 2 X i + (3 3 Z i )Y i . 

(The T is for term, not treatment.) It takes a mo- 
ment to verify (12), starting from the equation 

(13) Ti = Yi log(pi) + (1 - Yi) log(l - pi). 

Each Ti is negative. The function (3 — ► L n (f3) is 
strictly concave, as one sees by proving that L" n is 
a negative definite matrix. Consequently, there is a 
unique maximum at the MLE (3 n . We write (3 n to 
show dependence on the size n of the study popula- 
tion, although that creates a conflict in the notation. 
If pressed, we could write f3 n j for the jth component 
of the MLE. 

The ith row of the "design matrix" is (l,Xi,Zi). 
Tacitly, we are assuming this matrix is nonsingular. 
For large n, the assumption will follow from regu- 
larity conditions to be imposed. The concavity of 
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L n is well known. See, for instance, pages 122-123 
in Freedman (2005) or page 273 in Amemiya (1985). 
Pratt (1981) discusses the history and proves a more 
general result. 

For reference, we record one variation on these 
ideas. Let M be an n x p matrix of rank p; write 
Mi for the ith row of M. Let y be an n x 1 vector 
of 0s and Is. Let be a p x 1 vector. Let Wi > 
for i = l,...,n. Consider M and y as fixed, /? as 
variable. Define L(0) as 



n 

E 

i=l 



log[l + exp(Mi • 0)) + (M* • 

The function (3 — ► L(/3) is strict- 



Proposition 1. 
/y concave. 

One objective in the rest of this section is showing 
that 

(14) /3„ converges to a limit /3oo as n-*oo. 
A second objective is showing that 

(15) the plug-in estimator A is consistent. 

The argument actually shows a little more. The plug- 
in estimator of, the ITT estimator of, and the pa- 
rameter of become indistinguishable as the size n 
of the study population grows; likewise for of, of 
and of. 

The ITT estimators of, of were defined in (11). 
Recall too that = nil? and nc = nirc are the 
numbers of subjects in T and C respectively. The 
statement of Lemma 1 involves the empirical distri- 
bution of Zi for i £ T, which assigns mass 1 /nj- to 
Zi for each i &T. Similarly, the empirical distribu- 
tion of Zi for iGC assigns mass 1 / nc to Zi for each 
ieC. 

To prove Lemma 1, we need the likelihood equa- 
tion L' n (0) = 0. This vector equation unpacks to 
three scalar equations in three unknowns, the com- 
ponents of that make up n : 



(16) 



(17) 



1 



p0 n ,l,Zi) 



1 

nx 



iGT 



— Y / P0nAZ l ) = — J2^, 



nc 



ieC 



n ctec 



-y n 1 

(18) -Y t p0 ni X ii Z i )Zi = -Y t Y i Z i . 

This follows from (12)-(13) after differentiating with 
respect to 02, and /?3 — and then doing a bit of 
algebra. 



Lemma 1. If the empirical distribution of Zi for 
i G T matches the empirical distribution for i E C 
(the first balance condition), then the plug-in esti- 
mators a T , of match the ITT estimators. More ex- 
plicitly, 

1 n 1 

5>(A»,l,2i) = — E 1 *. 



i=l 



1 n 1 

-£ p (/3 n ,o,z,) = — ]Ty,. 

Proof. The plug- in estimators a , of were de- 
fined in (10); the ITT estimators a T , of, in (11). 
We begin with a T . By (16), 

— J2P0nA,Zi) = — J2 Y i = V T 



nx 



nr 



eT 



By the balance condition, 

— ^2p0nA,Zi) = — ^2p(0 n ,l,Zi) 



equals a T too. Finally, the average of p(0 n ,l, Zi) 
over all i is a mixture of the averages over T and C. 
So a T = of as required. The same argument works 
for of } using (17). □ 

For the next lemma, recall of, of from (1). The 
easy proof is omitted, being very similar to the proof 
of the previous result. 

Lemma 2. Suppose the empirical distribution of 
the pairs (Yi T , Yf) for i G T matches the empirical 
distribution for i £ C (the second balance condition). 
Then of = of and of = of . 

Lemma 3. Let x be any real number. Then 

e x - \e 2x < log(l + e x ) < e x , 



x + e 



2 e 



' x < log(l + e x ) <x + e~ x . 

The first bound is useful when x is large and neg- 
ative; the second, when x is large and positive. To 
get the second bound from the first, write 1 + e x = 
e x (l + e~ x ), then replace x by —x. The first bound 
will look more familiar on substituting y = e x . The 
proof is omitted, being "just" calculus. 

For the next result, let G be an open, bounded, 
convex subset of Euclidean space. Let f n be a strictly 
concave function on G, converging uniformly to /oo, 
which is also strictly concave. Let /„ take its maxi- 
mum at x n , while /oo takes its maximum at x^ G G. 
Although the lemma is well known, a proof may be 
helpful. We write G\H for the set of points that 
are in G but not in H. 
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Lemma 4. x n — ► Xoo and f n (x n ) — ► /oo(^oo)- 

Proof. Choose a small neighborhood H of Xqo — 
argmax/oo. There is a small positive (5 with foo{x) < 
/oo^oo) — 5 for x G G \ i7. For all sufficiently large 
n, we have \ f n — /oo| < S/3. In particular, / n (xoo) > 
/oo(^oo) — 5/3- On the other hand, if x G G\H, then 

/n(aO < /oo(a:) + S/3 < joc(zoo) - 25/3. 

Thus, argmax/ n G i?. Furthermore, f n (x n ) > 
fn{xoo) > /oo(^oo) — 5/3- I n the other direction, 

,foo(£oo) > foo(Xn) > fn(x n ) ~ So 

| max / n — max /oo | < 5/3, 

which completes the proof. □ 

For the final lemma, consider a population con- 
sisting of n objects. Suppose r are red, and r/n — > p 
with < p < 1. (The remaining n — r objects are col- 
ored black.) Now choose m out of the n objects at 
random without replacement, where m/n^ A with 
< A < 1. Let X m be the number of red objects that 
are chosen. So X m is hypergeometric. The lemma 
puts no conditions on the joint distribution of the 
{X m }. Only the marginals are relevant. 

Lemma 5. X m /n — ► Xp almost surely as n — > oo. 

Proof. Of course, E(X m ) = rm/n. The lemma 
can be proved by using Chebychev's inequality, after 
showing that 



E 



X m ~ r— 

n 



mV 



0(n 2 



Tedious algebra can be reduced by appealing to The- 
orem 4 in Hoeffding (1963). In more detail, let Wj 
be independent 0-1 variables with P{Wi = 1) = r/n. 
Thus, Y^Li Wi is the number of reds in m draws 
with replacement, while X m is the number of reds in 
m draws without replacement. According to Hoeff- 
ding's theorem, X m is more concentrated around the 
common expected value. In particular, 

4i 



n 



<E< 



n 



Expanding [X)^i(^ /7 « — n)] 4 y^ds m terms of the 
form (Wi — -) 4 . Each of these terms is bounded 
above by 1. Next consider terms like 
O^i — ^) 2 (Wj — ^) 2 with i ^ j. The number of such 
terms is of order m 2 , and each term is bounded 
above by 1. All remaining terms have expectation 



Note. There are m 4 terms in (oi H h a m ) 4 = 

J2ijki a i a j a k a i- By combinatorial arguments: 

(i) m terms are like etj , with one index only. 

(ii) 3m(m — 1) are like ai 2 aj 2 , with two different 
indices. 

(hi) 4m(m — 1) are like Oj 3 aj, with two different 
indices. 

(iv) 6m(m — 1) (m — 2) are like a^a^a^ , with three 
different indices. 

(v) m(m — l)(m — 2)(m — 3) are like aia^a^ai^ 
with four different indices. 

The counts can also be derived from the "multino- 
mial theorem," which expands (a\-\ \-a m ) N . For 

an early — and very clear — textbook exposition, see 
Chrystal (1889, pages 14-15). A little care is needed, 
since our counts do not restrict the order of the in- 
dices: i < j and i > j are both allowed. By contrast, 
in the usual statements of the multinomial theo- 
rem, indices are ordered (i < j). German scholarship 
traces the theorem ("der polynomische Lehrsatz") 
back to correspondence between Leibniz and Johann 
Bernoulli in 1695; see, for instance, Tauber (1963), 
Netto (1927, page 58), and Tropfke (1903, page 332). 
On the other hand, de Moivre (1697) surely deserves 
some credit. 

We return now to our main objectives. In out- 
line, we must show that L n ((3)/n converges to a 
limit L 00 (f3), uniformly over f3 in any bounded set; 
this will follow from Lemma 5. The limiting L 00 (/?) 
is a strictly concave function of (3, with a unique 
maximum at Pqq: see Proposition 1. Furthermore, 
Pn —> Poo by Lemma 4. In principle, randomization 
ensures that the balance conditions are nearly satis- 
fied, so the plug-in estimator is consistent by Lem- 
mas 1-2. A rigorous argument gets somewhat intri- 
cate; one difficulty is showing that remote /?'s can 
be ignored, and Lemma 3 helps in this respect. 

Some regularity conditions are needed. Technical- 
ities will be minimized if we assume that Zi takes 
only a finite number of values; notational overhead 
is reduced even further if Zi = 0, 1, or 2. There are 
now 3x2x2 = 12 possible values for the triples 
Z u Yf,Y?. We say that subject i is of type zct pro- 
vided 



Y 



C 



y: 



t. 



0. Thus, E[(X n 



) ] is of order m 2 < n 2 



□ 



Let 9 Z:C) t be the fraction of subjects that are of type 
zct; the number of these subjects is n6 z ^t- 

The 6's are population-level parameters. They are 
not random. They sum to 1. We assume the #'s are 
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all positive. Recall that ttt is the fraction of subjects 
assigned to T. This is fixed (not random), and < 
ttt < 1 • The fraction assigned to C is ttq = 1 — ttt • 
In principle, ttt, ttc, and the Q z , c ,t depend on n. As 
n increases, we assume these quantities have respec- 
tive limits Xt, Ac* and X z ,c,t, ah positive. Since z 
takes only finitely many values, J2 Z c t ^z,c,t = 1- 

When n is large, within type zct, the fraction of 
subjects assigned to T is random, but essentially Xt- 
such subjects necessarily have response Yi = t. Like- 
wise, the fraction assigned to C is random, but es- 
sentially Ac: such subjects necessarily have response 
Yi = c. In the limit, the Z's are exactly balanced be- 
tween T and C within each type of subject. That is 
the essence of the argument; details follow. 

Within type zct, let n% c t and n^ c t be the number 
of subjects assigned to T and C, respectively. So 



n 



z,c,t 



c 

z,c,t 



n9 



z.c.t- 



The variables n^ c t are hypergeometric. They are 
unobservable. This is because type is unobservable: 
Yi and Yi T are not simultaneously observable. 

To analyze the log-likelihood function L n {(3), re- 
call that Yi = XiYi T + (1 - Xi)Yi C is the observed 
response. Let n Z)X)V be the number of i with Zi = 
z, Xi = x,Yi = y; here z = 0, 1, or 2, x = or 1, and 
y = or 1. The n XiX ,y are observable because Yi is 
observable. They are random because Xi is random. 
Also let n Z)X = n ZjXj o + n zx i, which is the number 
of subjects % with Zi = z and Xi = x. Now L n ((3)/n 
in (12) is the sum 



(19a) 

where 
T z 

(19b) 



n 



2,2 



■log[l + exp(/3i +/32^ + /3 3 2)] 
— {pi + fax + faz). 



n 



n 



(Again, T is for "term," not "treatment.") This can 
be checked by grouping the terms Tj in (12) accord- 
ing to the possible values of (Zi,Xi,Yi). There are 
six terms T ZjX in (19), corresponding to z = 0, 1, or 
2 and x = or 1 . 
We claim 



(20) 



n 



z,x,y 



c 



r 



n z,0,y + n± z,l,y if 2; — 1, 



n z,y,0 + n Xy,l if ^ — 0- 



The trick is seeing through the notation. For in- 
stance, take x = 1. By definition, n Zj i i3/ is the number 



of i with Zi = z, Xi = 1, Yi = y. The i's with Aj = 1 
correspond to subjects in the treatment group, so 
Yi = Yi T . Thus, n Zt i t y is the number of i with Zi = 
z,Xi = l,Yi T = y. Also by definition, i s the 

number of subjects with Zi = z,Xi = 1, Yi = c, Yi T = y. 
Now add the numbers for c = 0, 1: how these sub- 
jects would have responded to the control regime is 
at this point irrelevant. A similar argument works if 
x = 0, completing the discussion of (20). 



Recall that 6 



z,c,y 



A 



z,c,y 



as n — > oo. Let 



z,c,y 



and A z = ^ A 



z,c,y 



'■■!) 



Thus, 9 Z is the fraction of subjects with Zi = z, and 
G Z — > A 2 as n — > oo. 
As n — ► oo , we claim that 

(21) n-z,i, y /n ->• AT(A 2i0 ,y + A 2) i )3/ ), 

(22) n Z)1 /n -» A T A Z , 

(23) n Zt o 7 y/n — > Ac(A 2>J/i o + A Ziy) i), 

(24) n 2i o/n -» A C A 2 , 

where, for instance, At is the limit of 7Tt as n — > oo. 
More specifically, there a set M of probability 0, 
and (21)-(24) hold true outside of A/". Indeed, (21) 
follows from (20) and Lemma 5. Then (22) follows 
from (21) by addition over y = 0, 1. The last two 
lines are similar to the first two. 

A little more detail on (21) may be helpful. What 
is the connection with Lemma 5? Consider n^ 0y , 
which is the number of subjects of type zOy that are 
assigned to T. The "reds" are subjects of type zOy, 
so the fraction of reds in the population converges to 
A 2 ,o,j/> by assumption. We are drawing m times at 
random without replacement from the population 
to get the treatment group, and m/n — > Xt, also 
by assumption. Now X m is the number of reds in 
the sample, that is, the number of subjects of type 
zOy assigned to treatment. The lemma tells us that 
X m — ► AtA Zj o,j/ almost surely. The same argument 
works for n^ ly . Add to get (21). 

Next, fix a positive, finite, real number B. Con- 
sider the open, bounded, convex polyhedron Gb de- 
fined by the six inequalities 



(25) 



\Pi +(3 2 x + (3 3 z\ <B 



for x = 0, 1 and z = 0, 1, 2. As n — > oo, we claim that 
L n (f3)/n — ► L QO (f3) uniformly over (3 £ Gb, where 

(26a) L 00 {(3) = X T A T + X C A C , 
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Table 2 

Asymptotic distribution of {Z,X,Y} 
expressed in terms of At, Ac and X z , c ,t 



Value 


Weight 


zll 


At(A z ,o,i + ^z,i,l) 


zlO 


At(A z ,o,o + A Zj i,o) 


zOl 


Ao(A^,i,o + Az,i,i) 


zOO 


Ac(A 2 ,o,o + A 2 ,o,i) 



(26b) 

+ (A«,o,i + >^z,i,i)<fa(z)), 
Ac = J^(-X z log[l + esxp(<f>c(z))] 

(26c) 

+ (A 2 ,i,o + A^ i i i i)(/>c(2;)), 

(26d) 

<M*0 = A + 

(Recall that At was the limit of tit as n — > oo, and 
likewise for Ac.) This follows from (21)-(24), on 
splitting the sum in (19) into two sums, one with 
terms zl and the other with terms zO. The z\ terms 
give us At At, and the zO terms give us Ac* Ac- The 
conclusion holds outside the null set M defined for 
(21)-(24). 

It may be useful to express the limiting distri- 
bution of {Z,X,Y} in terms of At, Ac and \ ZjC ,t, 
the latter being the limiting fraction of subjects of 
type zct. See Table 2. For example, what fraction of 
subjects have Z = z,X = 1,Y = 1 in the limit? The 
answer is the first row, second column of the table. 
The other entries can be read in a similar way. 

The function — > L oo (0) is strictly concave, by 
Proposition 1 with n = 12 and p = 3. The rows of 
(M y) run through all 12 combinations oilzxy with 
z = 1, 2, 3, and x = 0, 1, and y = 0,1. The weights are 
shown in Table 2. 

Let 0oo be the that maximizes L oo (0). Choose B 
in (25) so large that 0oo € Gb- Lemma 4 shows that 
max^ e G B L n (0)/n is close to L QO (0 OO ) for all large 
n. Outside Gb — if B is large enough — L n (0)/n is 
too small to matter; additional detail is given below. 
Thus, n € Gb for all large n, and converges to 0^. 

This completes the argument for (14) and we turn 
to proving (15) — the consistency of the plug-in esti- 
mators defined by (10). Recall that 9 Z is the fraction 



of i's with Zi = z; and 9 Z — ► A 2 as n — ► oo. Now 
1 " 

" T = ~y]p(/3n,l,-^) 
n r— i 

i=i 

z 

^^2\ z p(0oo,l,z), 

z 

where the function p{0,x,z) was defined in (4). Re- 
member, z takes only finitely many values! A similar 
argument shows that of — > J2z ^zP{0oo,O, z). 

The limiting distribution for {Zi, Yp , Y^} is de- 
fined by the X z ,c,tt where A 2jCj t is the limiting fraction 
of subjects of type zct; recall that A z = J2ct X z ,c,t- 
We claim 

(27) £ A *p(/w>*) = Ew. 

2 Z,C 

(28) ^A zP (/3 oo ,0,z) = ^A z ,i,t. 

2 Z,t 

Indeed, (22) and (24) show that in the limit, the 
Zi are exactly balanced between T and C. Like- 
wise, (21) and (23) show that in the limit, the pairs 
Yp, Yp are exactly balanced between T and C. Ap- 
ply Lemmas 1-2. The left-hand side of (27) is the 
plug- in estimator for the limiting a T . The right-hand 
side is the ITT estimator, as well as truth. The three 
values coincide by the lemmas. The argument for 
(28) is the same, completing the discussion of (27)- 
(28). 

The right-hand side of (27) can be recognized as 
the limit of ± Ya=i Y i T = E 2 , c ^,c,ii likewise, the 
right-hand side of (28) is the' limit of ^E"=i^ C '- 
This completes the proof of (15). In effect, the ar- 
gument parlays Fisher consistency into almost-sure 
consistency, the exceptional null set being the 
where (21)-(24) fail. 

Our results give an indirect characterization of 
lim/3 n as the at which the limiting log-likelihood 
function (26) takes on its maximum. Furthermore, 
asymptotic normality of {n^ c t \ entails asymptotic 
normality of n and the plug-in estimators, but that 
is a topic for another day. 

Additional Detail on Boundedness 

Consider a zl term in (19). We are going to show 
that for B large, this term is too small to matter. 
Fix a small positive e. By (22), for all large n, 

n Zj i/n > (1 - e)\ T \ z ; 
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by (21), 

n z ,i,i/n < (1 + e)A T (A Zj0 ,i + >>z,i,i)- 

Let z' = 0i + P 2 + foz > B > 0. By Lemma 3, 

log[l + exp(z')] > z + exp(— z) — | exp(— 2z') > z' 

because z' > B > 0. Our zl term is therefore bounded 
above for all large n by 

[-(1 - e)X z + (1 + e)(A* |0 ,i + A,,i,i)] \ T z. 

The largeness needed in n depends on e not B. 
We can choose e > so small that 

(l + e)(A 2 ,o,i + A z ,i,i)<(l-2e)A z , 

because A 2j o,i + Az,i,i < As- Our term is therefore 
bounded above by —eXtX z B. For large enough, 
this term is so negative as to be irrelevant. The argu- 
ment works because all A 2jCj j are assumed positive, 
and there are only finitely many of them. A similar 
argument works for z' = f3\ + fli + foz < — B, and 
for terms zO in (19). These arguments go through 
outside the null set M defined for (21)-(24). 

Summing up 

It may be useful to summarize the results so far. 
The parameter a T is defined in terms of the study 
population, as the fraction of successes that would 
be obtained if all members of the population were 
assigned to treatment; likewise for a . See (1). The 
differential log odds A of success is defined by (2). 
There is a covariate taking a finite number of val- 
ues. A fraction of the subjects are assigned at ran- 
dom to treatment, and the rest to control. We fit a 
logit model to data from this randomized controlled 
experiment, although the model is likely false. The 
MLE is j3 n . ITT and plug- in estimators are defined 
by (10)-(11). 

The size of the population is n. This is increasing 
to infinity. "Types" of subjects are defined by com- 
binations of possible values for the covariate, the 
response to control, and the response to treatment. 
We assume that the fraction of subjects assigned to 
treatment converges to a positive limit, along with 
the fraction in each type. The parameters a T and 
a converge too. This may seem a little odd, but a T 
and oF may depend on the study population, hence 
on n. 

Theorem 1. Under the conditions of this sec- 
tion, if a logit model is fitted to data from a ran- 
domized controlled experiment: (i) the MLE f3 n con- 
verges to a limit (3^; (ii) the plug-in estimator a T , 



the ITT estimator a T , and the parameter a T have 
a common limit; (iii) or, a c , and a c have a com- 
mon limit; (iv) A, A, and A have a common limit. 
Convergence of estimators holds almost surely, as 
the sample size grows. 

Estimating Individual-Level Parameters 

At the beginning of the paper, it was noted that 
the individual-level parameters Y^ and Yp are es- 
timable. The proof is easy. Recall that AQ = 1 if i 
is assigned to treatment, and X j = otherwise; fur- 
thermore, P(Xi = 1) = 7Pr is in (0, 1). Then YiXi/iiT 
is an unbiased estimator for Yi T , and 5^(1 — AQ)/(1 — 
ttt) is an unbiased estimator for Y{ , where Y{ = 
XiYi T + (1 - Xi)Yi C is the observed response. 

9. AN INEQUALITY 

Let subject i have probability of success pi if treated, 
qi if untreated, with < qi < 1 and the qi not all 
equal. Suppose 



Pi 



A- 



1 - Pi 1 - qi 
for all i, where A > 1. Thus, 

Mi 



Pi 



1 + (A - l) qi 

and < pi < 1. Let p = - J2iPi be the average value 
of pi, and likewise for q. We define the pooled mul- 
tiplier as 

V/i} -P) 

The log of this quantity is analogous to the differ- 
ential log odds in (2). 
The main object in this section is showing that 

(29) A is strictly larger than the pooled multiplier. 

Russ Lyons suggested this elegant proof. Fix A > 1. 
Let f(x) = x/(l — x) for < x < 1. So / is strictly 
increasing. Let h{x) = f~ 1 (Xf(x)), so pi = h(qi). In- 
equality (29) says that f(p) < Xf(q), that is, p< 
h(q). Since pi = h(qi), proving (29) comes down to 
proving that h is strictly concave. But 



h(x) 



Xx 



l + (A-l)z 
A 



A-l 



1 



l + (A-l)ac/' 

and y — > 1/y is strictly convex for y > 0. This com- 
pletes the proof of (29). 
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In the other direction, 



(30) 



V 



p-q 



>0 



I-P 1-? (l-p)(l-g) 

because pi > qi for all i. So the pooled multiplier 
exceeds 1. In short, given the assumptions of this 
section, pooling moves the multiplier downward to- 
wards 1. Of course, if A < 1, we could simply inter- 
change p and q. The conclusion: pooling moves the 
multiplier toward 1. 

In this paper, we are interested in estimating dif- 
ferential log odds. If the logit model (4) is right, 
the coefficient ($% of the treatment indicator is a bi- 
ased estimator of the differential log odds A in (2) — 
biased away from 0. That is what the inequalities 
of this section demonstrate, the assumptions being 
/?3 0, 2j is nonrandom, and Zi shows variation 
across i. (Random Zi are easily accommodated.) 

If the logit model is wrong, the inequalities show 
that (3 2 > A if A > 0, while (3 2 < A if A < 0. The as- 
sumptions are the same, with replaced by at- 
tention being focused on the limiting values defined 
in the previous section. Since the plug-in estimator 
A is consistent, $2 must be inconsistent. 

The pooling covered by (29)-(30) is a little differ- 
ent from the collapsing discussed in Guo and Geng 
(1995). (i) Pooling does not involve a joint distribu- 
tion for {Xi,Zi}, or a logit model connecting Yi to 
Xi and Zi . (ii) Guo and Geng consider the distribu- 
tion of one triplet {Yi,Xi,Zi} only, that is, n = 1. 
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